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Abstract 


We consider the problem of identifying tandem scattered subse- 
quences within a string. Our algorithm identifies a longest subsequence 
which occurs twice without overlap in a string. This algorithm is based 
on the Hunt-Szymanski algorithm, therefore its performance improves 
if the string is not self similar, which occurs naturally on strings over 
large alphabets. Our algorithm relies on new results for data struc- 
tures that support dynamic longest increasing sub-sequences. In the 
process we also obtain improved algorithms for the decremental string 
comparison problem. 


Keywords: Hunt-Szymanski algorithm, longest increasing sub- 
sequence, Tandem, sub-sequence 


1 Introduction 


In this paper we study longest common scattered sub-sequences (LCSS). 
Given two strings P and S the LCSS is used extensively as a measure of 
similarity. In particular we consider a variant of this problem, where the 
LCSS must occur twice without overlap in an initial string F’. We study 
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algorithms and data structures that are relevant for this goal. Namely 
we use the Hunt-Szymanski algorithm [1977] and present new results for 
data structures that maintain information about the longest increasing sub- 
sequence of a dynamic sequence of numbers and new algorithms for the 
decremental string comparison problem. Specifically we get the next results: 


1. A data structure to maintain the longest increasing subsequence (LIS) 
of a dynamic list of numbers. This structure can be used to efficiently: 
Append a number at the end of the list; remove the current minimum 
value from the sequence (ExtractMin); obtain a current longest in- 
creasing subsequence (GetLIS). When the list contains @ > 2 elements, 
the operation Append requires O(log £) time®. If the size of the LIS 
is \ > 2 then the ExtractMin operation requires O(A log @) time. See 
Theorem 1. We further improve these bounds by analyzing batches 
of operations and assuming the final sequence is empty. In this case 
ExtractMin requires O(A(1 + log(min{A, @/A}))) amortized time per 
operation and Append requires O(1) amortized time, when the num- 
bers are inserted in decreasing sequences of size at least A elements. 
See Theorem 2. This structure uses optimal O(£) space. 


2. Using the Hunt and Szymanski [9] reduction from LCSS to LIS we 
obtain new bounds for the decremental string comparison problem. In 
particular, for a given string F' of size n > 1, we show that it is possible 
to obtain all the LCSS values for all the pairs of strings P and S such 
that F = P.S in O(min{n, 2}A(1 + log(min{A, £/A})) + nA + £) time, 
where A > 2 is the size of the LCSS and @ > 2 is the number of pairs 
of positions in F’ that contain the same letter. Therefore it is possible 
to determine the LTSS within this time, i.e., the LCSS which occurs 
twice without overlap in a string F’. 


2 The Problem 


Let us start by describing the longest tandem scattered sub-sequence (LTSS) 
problem of a given string fF’. We will use a running example with F = 
AGCGAACGGGTA. The meaning of tandem is that the sub-sequence needs to 
occur twice without overlap in F’. Therefore F' can be partitioned into a 


3Note that to simplify expressions as O(1+ log @) we impose restrictions on parameters 
such as £ > 2. This also avoids invalid expressions such as when £ = 0. In general the 
complexity of the excluded cases is O(1). 
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prefix P and a suffix S, i.e., F = P.S, such that the desired scattered sub- 
sequence is a longest common scattered sub-sequence (LCSS) between P 
and S. To determine which partition yields the overall largest sub-sequence 
it is necessary to test all such partitions. 

Let us consider the partition with P = AGCG and S = AACGGGTA. The 
LCSS is the longest string that occurs as a scattered sub-sequence of P 
and S. Figure 1 illustrates that the string ACG is a longest common scattered 
sub-sequence of P and $. A common sub-sequence can be defined as a set 
of pairs (7,7) where i is an index over P and j an index over S and the 
i-th letter of P is equal to the j-th letter of S, represented as P(i) = S(j). 
All numbers 7 must be distinct among themselves and all numbers j must 
be distinct among themselves. Moreover sorting the pairs by 7 must also 
yield a sorted sequence by 7. In our example the LCSS for P and S can be 
represented by the set {(1, 1), (3,3), (4, 4)}. 

Figure 1 also shows an LCSS for a second prefix suffix decomposition of 
F = P'.S'. This second decomposition is related to the first as P’ = P.A and 
in fact the LCSS is similar to the previous LCSS, with the extra character A, 
i.e., ACGA. 

In this example the LCSS between P’ and S$" is the desired overall LTSS. 


P= ™Y ¢ © © 
Ss= (A) Aa (Cc) (GG GGaGtTaa 


P= @ ¢ ©O@@ 
s-969@@ccre@ 


Figure 1: Example of LCSS for two prefix suffix decompositions of F, F = 
2S and. BSP" 35! 


3 Decremental Comparison and Hunt-Szymanski 


In this section we present the main ideas of an algorithm that computes 
LTSS. Given a string F’ we can reduce this problem to computing the size of 
the LCSSs for all prefix and suffix decompositions of F’, i.e., for all P.S = F, 
where P is a prefix and S is a suffix. The resulting tandem can be obtained 
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from the overall largest LCSS. The pseudo code for this process is shown 
in Algorithm 1, where F'(2’) represents the i’-th letter of F,, i-e., the letters 
indexes start at 1. 

This process involves n LCSS computations, when the size of F is n. 
This computation is referred in line 8 of Algorithm 1. Each LCSS can be 
determined with the classical dynamic programming table between P and S. 
Table D is a bi-dimensional array that stores integers. Each value D{i, j] 
represents the size of the LCSS between the prefix of P with 7 letters and 
the prefix of S with 7 letters. The coordinate i ranges from 0 to the size 
of P, likewise coordinate 7 ranges between 0 and the size of S. The value 0 
represents the empty prefix. 

The values D{i, 7] can be computed locally according to the equalities 
bellow, where P(i) denotes the i-th letter of P and S(j) the j-th letter of S: 


D{i, j] =0 ifi=Oorj=0 (1) 
D{i,j] = D[i-1,j -1]) +1 if i,j > 0 and P(i)=S(j) (2) 
Dii, j] = max{D{i,j — 1], Dji—1,j]} if i,j > O and P(t) # S(j) (3) 


Let us consider a running example with P = AGCG and S = AACGGGTA. The 
values of table D{i, 7] are shown in the top portion of Figure 2. For example 
the value D[4,3] is 2, which means that the LCSS between AGCG and AAC 
has size 2. This table requires O(n”) time to build, when P and S have O(n) 
size. In the example of Figure 2 the desired maximum value is D[4,8] = 3, 
it is this value that is obtained by the GetMaxD operation in Algorithm 1. 

The LCSSs can be recovered with tracebacks. A traceback is a pointer 
from a cell D[i,7] to one of its neighboring cells D[i — 1,7], D{i,j — 1] or 
D\{i—1,j-—1]. The resulting paths represent the corresponding LCSSs. The 
diagonal tracebacks represent matches between the corresponding strings, 
we show only these tracebacks in Figure 2. In our example there is a diagonal 
traceback from D/[4, 4], representing the fact that both strings end with the 
letter G. Let us then consider S$’ = ACGGGTA and P’ = P.A. We also need to 
compute the D table for these strings, shown at in the middle of Figure 2. 
To avoid confusion we refer to this table as D’ in the text. In Algorithm 1 
all D tables are always represented by D. 

We aim to compute a representation of table D’ in O(n) time, instead 
of O(n?), ie., we aim to reduce the time bound of the UpdateD operation. 
First let us highlight the changes between D and D’. Column Djii, 1] is 
removed, which corresponds to removing A from S. Row D’[5, 7] is inserted, 
which corresponds to appending A to P. Several values are maintained, 
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Figure 2: Illustration of tables D, D’ and D” with diagonal tracebacks. 


Dii,j +1) = D'ji,j]. The remaining values decrease by 1, i.e., D’[i, 7] = 
D{i,j +1] —1. In our example only the values of column D’/i,0] decrease, 
the remaining values are maintained. To determine which cells change and 
which remain constant we consider another representation of table D. The 
representation used in the Hunt-Szymanski, Algorithm [1977]. 

Let us now focus on how to efficiently compute decremental string 
comparison, i.e., a simple and efficient way to obtain table D’ from table D. 
Which is the process we will use in line 8 of Algorithm 1. Let us start by re- 
viewing and augmenting the Hunt-Szymanski algorithm [9]. The algorithm 
works by reducing the LCSS to the problem of determining a longest increas- 
ing sub-sequence of numbers (LIS). This reduction is illustrated in Figure 3. 
It works in two steps. In the first step it processes S. For every letter 6 
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Algorithm 1 SizeOfLTSS(F’) 


1: 2 <— |F'| 

2 Pte > Empty string. 
3 Ser 

4.2 <0 > Initial value for size of LTSS. 
5: for 7’ =1,...,n—1do 

6: Pe PF(i’) > Appends letter to P. 
7: Pop(S) > Removes the first letter of S. 
8: UpdateD(F(i’)) b> Update D according to the change in P and S. 
9: A < GetMaxD() > A becomes the size of the corresponding LCSS. 
10: if x < A then 
11: rer > Found a new maximum. 
1: end if 

13: end for 

14: return x > Size of the overall LTSS. 


in S it computes the list of positions where b occurs in S’,, represented by 
Ms(b). In the second step it processes P, from left to right, and produces 
a list of numbers Ps. For every letter b of P the list Mg(b) is appended to 
the current list of numbers. 

The resulting list Ps consists of a list of positions of S, where the same 
position may appear several times. Hence selecting a subsequence from Ps 
is equivalent to choosing letters from S and P simultaneously. In our exam- 
ple a resulting longest increasing subsequence is 1 < 3 < 4. Selecting these 
letters from S yields the desired common subsequence ACG. To avoid select- 
ing the same letter from S' repeatedly the LIS needs to be strictly increasing. 
Moreover, to guarantee that a letter from P is selected only once the lists 
Ms(b) are sorted in decreasing order and this order is used to build Ps. 

The Hunt-Szymanski algorithm then proceeds to efficiently compute 
the LIS. In this context we represent the list of numbers by L, abstracting 
away the process that was used to produce it, i.e. LD = Ps. We will use @ to 
refer to the size of L. To determine the LIS the algorithm uses a sequence 
of threshold lists T;. List 7; contains the element i of L if the longest 
increasing subsequence which ends with 7, of the first elements of L up 3, 
has size k. The top box of Figure 4 shows this threshold structure for the 
sequence L we are considering. In this example we can observe that T> 
contains the value 6, which corresponds to the first 6 in L. This occurs 
because the LIS up to that element has size 2, namely it could be 2,6. In 


Small Longest Tandem Scattered Subsequences 85 


@2G6)@5678 


P= A G C G 
ee es, 0 re 
LSPe= 8; Bs Ge 6x55 4,8) 6 84) 


, + 4 
2 9 9 


A G 


Figure 3: The Approach 


another example we can observe that 73 also has a value 6, this corresponds 
to the second 6 in LF and is in this list because the LIS up to this element 
has size 3, namely it could be 2,4, 6. 


Now let us return to the decremental string problem and study how this 
data structure is affected when P changes to P’ and S' changes to S$’. The 
top part of Figure 2 shows the dynamic programming table D, for P and S. 
The figure also illustrates D’ and D” for the consecutive decompositions that 
append letters to P and remove letters from the beginning of S. This figure 
serves to illustrate the relation between the D table and the Tj, lists. Figure 2 
shows only diagonal tracebacks, as these are the only ones that appear in 
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Figure 4: Dynamic LIS computation. The top box shows the Jj, lists for P 
and S. 


the T;, lists. For example, if we consider the cells in D that are equal to 2, 
the list T2 gives a representation of this set. The cells (2,6), (2,5), (2,4) and 
(3,3) are the respective diagonal tracebacks. The Tj, lists store only the j 
coordinates, therefore the list T> contains 6 < 5 < 4 < 3. In general the 
list 7}, stores the decreasing 7 coordinates of the cells with D value k and 
diagonal tracebacks, i.e., there is an i such that D/i,j] = k and P(i) = S(j). 


Computing the threshold structure is done incrementally by processing 
the elements of L from left to right. Therefore the original Hunt-Szymanski 
algorithm already supports the Append operation, which updates the struc- 
ture when a new number is appended. For completeness and latter analysis 
we present this process in Algorithm 3. Each list T; is stored in decreas- 
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ing order. With this organization the sequence of tail elements is kept in 
increasing order throughout the execution of the algorithm. After process- 
ing our sample list L the resulting sequence of tail elements is 1 < 3 < 4. 
This order is used to determine in which 7; a given element of L should 
be inserted. Let us consider our running example and start with all the 7; 
lists empty. The first 8 initializes T,. The 2 is also appended to 7; because 
8 > 2, likewise 1 is also appended to T; because 2 > 1. Number 6 initializes 
list 75, because 1 < 6, which becomes the current sequence of tail elements. 
Numbers 5,4 and 3 are also appended to Ty, as they are all greater than 1 
and in decreasing order. The number 6 initializes list T3, because 1 < 3 < 6. 
Likewise 5 and 4 are also appended to 73. Since list 73 is not empty, we 
know that DL contains an increasing subsequence of length 3. Since 7, was 
left empty, there is no increasing subsequence of length 4. Therefore the 
size of a LIS in our example is 3. 


4 Efficient Dynamic LIS 


We now focus on adapting the threshold structure to obtain the longest 
subsequence which occurs twice without overlap, in a string Ff’. Consider 
F = AGCGAACGGGTA, in which case the subsequence could be ACGA, which 
has size 4. To obtain this value we execute Algorithm 1 using the threshold 
structure to implement line 8, meaning that we aim to update D to reflect 
the changes in P and S. The procedure we propose to perform this process 
is sketched in Algorithm 2. It requires two primitives from the dynamic LIS 
data structure. The ExtractMin operation removes the minimum element 
of L and the Append(z) operation appends the number i to the end of L. 
The rational for these operations is that Append is used to update D when 
a letter is added to P and removing the letter from S implies removing 
the minimum position, from L and from Ms(c), were c is the letter being 
removed, i.e., c = F(i’). Note that this latter operation can be done in 
O(1) time because in Mg(c) the minimum occurs at the last position of the 
list. Appending the letter c to P corresponds to increasing the list L by 
appending the new Mg(c) positions to L, this is performed by the for cycle 
in Algorithm 2. 

Hence the crucial component to obtain an efficient implementation 
of Algorithm 1 is a data structure that can compute the Append and 
ExtractMin operations quickly. Assume that we have the 7; lists for the 
strings P and S. We aim to update this structure for strings P’ = P.A and 
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Algorithm 2 UpdateD(c) 
if 0 < @ then > No extraction if DL is empty. 
ExtractMin() 

end if 

: Trim(Ms(c)) > Remove the min also from Mg(c). 

: for alli € Mg(c) do 

Append(?) 

: end for 


S’ = ACGGGTA, i.e., we want to append a letter to P and remove the first 
letter from S. Removing the first letter from S changes the Mg(b) lists, 
in particular all the positions are offset by 1, for example Mg/(G) = 5, 4,3. 
This offset does not alter the relative order of the numbers nor the shape 
of the threshold structure. Therefore we ignore this offset, to simplify the 
exposition, and also to reduce the complexity of the algorithm. Instead as- 
sume that we start numbering the positions of $” at 2. Now the only change 
is to Mg:(A) = 8,2, which looses position 1, as it is no longer part of $’. In 
general the position that gets removed is the overall minimum. Hence we 
need to apply an ExtractMin operation to the threshold structure. This 
operation should remove all the instances of the minimum in JL, in this case 
all the instances of 1. In this particular example there is only one instance, 
but in general there can be several such occurrences. Algorithm 4 shows 
the precise pseudo-code for this operation, let us now illustrate it with our 
running example. 

Due to the decreasing order of the T;, lists and increasing order of the 
tail elements it is straightforward to locate the overall minimum element. 
The minimum is always the tail element of T;. Notice that even if there 
are several instances of the minimum in L there is only one element in 7}, 
because of our approach of discarding duplicated elements. Now remove this 
element from T;. The resulting structure still maintains the necessary orders 
as the sequence of the tail elements becomes 2 < 3 < 4, and the internal 
order of the T;,’s was not altered. This structure is shown in the second 
box of Figure 4. This is indeed the same structure that can be obtained 
from the sequence 8, 2,6,5,4,3,6,5,4. Thus, in this case, no further work 
is required. 

Since P’ contains an extra A, we need to append the list Mg(A) to L. 
Therefore we execute the operations Append(8) and Append(2). This alters 
the J; lists, as explained above. Appending the number 8, initializes Ty, 
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because 2<3<4< 8. Appending the number 2 does not produce any 
change because it is already the tail of T; and we do not store repetitions in 
the Jj, lists. In this case we simply drop the element, Section 4.1 describes 
more a elaborated process that is used when the size of the LIS is not 
enough and we want to retrieve an actual such sequence. At this point 
we obtained a LIS of size 4 which identifies a LCSS of P’ and S’ that 
is our goal subsequence of F’. However to make sure this is indeed the 
longest subsequence we must continue Algorithm 1 and update the 7}, lists 
for P” = P’.A and 8S” = CGGGTA. Again we begin by computing ExtractMin. 
Notice that, because we do not store repetitions, this procedure removes 
both instances of the number 2. This time the operation is more elaborated 
because after removing the 2 from T; the resulting sequence of tail elements 
is no longer increasing. Note that 8 is the tail of 7; and 3 is the tail of T> 
and 8 > 3. To solve this problem we could transfer the 3 from 74 to Tj, 
thus fixing the first inequality as 3 < 4. However this is not correct. Note 
that at this point the sequence L we are considering is 8, 6,5, 4,3, 6,5, 4, 
in which case 7, should be 8 > 6 > 5 > 4 > 3. Therefore the correct 
procedure is to remove all the elements from 7T2 and append them to 7}. 
Now JT» becomes empty so all the elements from 73 are moved to 75, which 
leads T3 to become empty and therefore all the elements from Ty, are moved 
to 73. Hence 7, becomes empty and the process terminates because Ty, was 
the last list. 

The general procedure for ExtractMin is to remove the tail element 
from 7, and then transfer from 7> to JT; all the elements that are smaller 
than the current tail element. The process continues from T,41 to 7, until 
there are no further elements to transfer, either because 7,41 is empty or all 
its elements are larger than the current tail element of T;. This process is 
summarized in Algorithm 4. In Section 4.1 we formalize, extend and analyze 
this data structure. Our example finishes by appending 8, which, because 
we do not store repetitions in our Tj, lists, gets discarded and therefore 
does not alter the structure. For P” and S” the resulting LIS has size 3 
and is therefore smaller than the subsequence ACGA obtained for P’ and S$’. 
This was in fact a desired subsequence, but the algorithm must scan the 
remaining pairs of prefixes and suffixes to certify this conclusion. 


4.1 Implementation and Analysis 


First let us discuss which data structures can be used to efficiently store 
the threshold data structure. In the classical Hunt-Szymanski algorithm 
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each Tj), list can be stored in a stack, where reading the Top element and 
pushing new elements can be achieved in constant time. There is no need to 
pop elements from the stacks, so it is enough to store the Top values. These 
values are stored in an array so that it is possible to perform a binary search 
on the top elements. The procedure to execute Append(i) is to execute a 
binary search on the array to find k such that Top(7,_1) < i < Top(Ty). 
If 7; is empty assume its stack top is +00, also assume there is a sentinel 
list To with Top(7Jo) = —co. If for the resulting k we have Top(7,) = 7 
then the procedure stops, otherwise it performs Push(7;,, 7). 


To support the ExtractMin operation we prefer to use a different data 
structure. We represent the J; lists using balanced binary search trees 
(BST), in particular red-black trees. This allows us to compute Min(7;,), 
Insert (7;,, 7), Remove(T7;, i), Predecessor(7;,, i), Split(7;, v) and 
Concatenate(7;, v) in O(log @) time, where @ is the size of L. Like the 
Hunt-Szymanski algorithm, we keep an array Min[k] that stores the tail 
element of 7;,, so that it can be accessed in constant time. The Min(7;) 
operation finds the smallest element in 7;,. When the BST of Tj, is empty 
it returns too. The Insert (7;, 7) operation inserts the number 7 into the 
BST of T;. The Remove(7;, 7) operation removes the number 7 from the 
BST of Ty, if key i does not exist then an error is reported and the current 
process is stopped. The Predecessor(7;, 7) operation finds the largest 
element of Tj, that is less than or equal to i, ie., max{j € Th|j < i}, the 
result should be a pointer to the corresponding tree node v, if no such node 
exists the pointer should be NULL. The Split(7;, v) operation divides 
the BST of 7; in two by keeping all the nodes with keys strictly larger 
than v in JT; and putting v and the remaining nodes in a new BST. The 
operation Concatenate(7;,, v) joins the BST containing node v in front 
of the BST of 7}, assuming that all the key values in J; are larger than or 
equal to the key in v and v is the maximum key value in its BST. Recall 
that we assume that the values in T; are not repeated, therefore the Insert 
and Concatenate operations drop duplicated elements when they occur. 
Algorithms 3 and 4 show the pseudo-code for the Append and ExtractMin 
operations, respectively. Note that for the Append procedure the Min[k] 
array plays the role of the Top operation in the classical version. 


Let us now analyze the time performance of the Append procedure, 
Algorithm 3. Without the Min[k] array the overall time would be 
O((log £) (log X) + log @), where the first term accounts for the binary search 
in lines 4 to 10. The second term accounts for the Insert operation in 
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line 12. Using the Min[k] array the first term reduces to O(log A) and thus 
the overall time becomes O(log £) because A < @, since 2 is the size of a 
subsequence of L. 

Now let us analyze the ExtractMin procedure, Algorithm 4. The while 
loop executes at most A times. Each execution requires O(log @) time for 
the Predecessor, Split and Concatenate operations. Hence we obtain 
a bound of at most O(Alog@) time. However an even tighter bound is 
possible. This operation can be bounded by OF =, log(|T;|)), where |T;,| 
is the size of the list 7;,. Because the log function is concave and the size of 
all the lists adds up to @ we can use Jensen’s inequality [1906] to obtain an 
O(1+ Alog(€/A)) bound. The following derivation justifies the bound. 


Xr Xr 
log (|Tj 
Yog( (tel) = a7 UTD 
k=1 k=1 


Xr 
ie 
< Mog (3-H) 


k=] 
= log(e/d) 


This finishes the dynamic LIS contribution, which is summarized in 
the next Theorem. 


Theorem 1 I/t is possible to maintain a dynamic list with € > 2 num- 
bers such that the Append operation can be computed in O(log £) time and 
ExtractMin and GetLIS requires O(1 + Alog(¢/A)) time, for a longest in- 
creasing sub-sequence, of size X. 


The Append and ExtractMin operations have just been described in this 
section. The GetLIS operation is described in A. 

This result establishes some initial bounds of this data structure. How- 
ever these bounds are fairly non competitive for our goals. To determine 
an LTSS we might generate a sequence with £ = O(n”) elements and per- 
form O(£) Append operations and n ExtractMin operations. This yields 
an O(n? log n) time algorithm. Let us improve the performance of the dy- 
namic LIS data structure. First we change the red-black BSTs to finger 
trees [6, 8, 7]. This means that the Split and Concatenate operations that 
involve the t; > 2 tail elements of T;, require only O(log t;) amortized time, 
instead of O(log |Z;,|) time. Let us consider the overall algorithm, from the 
initial empty structure to the final one. We will analyze the overall time 
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Algorithm 3 Append (7) 
Ensure: Updated threshold structure for LZ with i appended. 
:£¢ (£41) > Increase size of L. 
j <0 
:ke (A + 1) 
: while 7 +1<k do 
m& [(j +)/2I 
if 1 > Min[m] then 
jc<m 
else 
kim 
end if 
: end while 
: Insert (T,, 7) 
: Mink] <7 
: if k= (+1) then 
15: Ae (A4+1) > LIS grows 
: end if 


SOP DOR TA HQ) OU ie Sk I 


eee Pe 
RwNr oS 


re 
a 


that is used to process a given list T;. The following argument applies for 
any & but for simplicity consider that we are analyzing 7). We have the 
following inequality: 


Search (a) 
i=1 


The left term in the inequality counts the number of elements that are moved 
from T\. The right side counts the number of elements that are removed 
from the data structure. The term n counts the number of elements that 
are actually removed from 71, one for each ExtractMin operation. The 
term (A — 1)n accounts for the elements that are dropped in the middle of 
the data structure. In each ExtractMin operation at most (A — 1) elements 
are dropped, one for each 7; list, except for 7). Now the total time of 
these operations is O(}>;_, log t;). We can obtain this value, restricted to 
Equation (4), by using Lagrange multipliers. We consider only one Lagrange 
multiplier, represented by c, because we have only one restriction. Hence 
the resulting Lagrangian expression is the following: 


3 log t; — AS t; — An) 
i=1 i=1 
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Algorithm 4 ExtractMin() 
Require: L is not empty 
Ensure: Updated threshold structure for L without the current minimum. 
:£¢ (€-1) > Decrease size of L. 
: Remove(7|, Min[1]) 
ki 2 
: if €>0 then 
while Min(7;,_1) > Min[k] do > Assuming Min|A + 1]= +00 
v + Predecessor(7},, Min(T;,_1)) 
Split (Ti, v) 
Concatenate(7;,_1, v) 
Min[k — 1] + Min[k] 
k¢ek+1 
end while 
: end if 
: Minfk —1] «+ Min(7;,_}) 


SOF QO: I FON. SONS IS Cs SND 


ee ee 


14: if Min[A] = +oo then 
15 Ae (A-1) > LIS shrinks 
16: end if 


A derivative in order of t; yields the following condition: 


a (5) 


i 
The derivative in order of c returns the original restriction: 


n 


i=1 


Combining both equations we obtain that c = 1/A and therefore logt; = 
log(A). If we use the same upper bound for all the other T;, lists we obtain 
O(nX log X) total time for n ExtractMin operations. This yields an amor- 
tized time of O(A log X) per operation, provided the final structure is empty. 
This new bound for ExtractMin is not necessarily smaller than the previous 
O(Alog(€/A)), but the best of both applies. 

Besides this bound for ExtractMin we also need a faster Append op- 
eration. Using the amortized performance of the finger tree data structure 
we obtain an O(log A) amortized bound for the Append operation. This 
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Algorithm 5 AppendBatch (i) 
Ensure: Updated threshold structure for LZ with i appended. 


1: £¢ (£41) > Increase size of L. 
2: if Min[k] <i then > Value of k is maintained between calls. 
3: ke A+1 

4: end if 

5: while k > 1 and Min[7;,_;] >i do 

6: ke k-1 

7: end while 

8: Insert (7,, 7) 

9: \ + max(A, k) 


performance can be further improved by considering how it is called from 
Algorithm 2. Notice that this operation is repeatedly called from the for 
loop in line 5 to append the elements in Mg(c). Moreover these elements 
are in decreasing order of value, this fact can be explored to obtain another 
time bound. With the previous bound the for loop in line 5 would require 
O(mlog X) time, when Ms5(c) contained m elements. By exploring the fact 
that the elements in Ms(c) are ordered we can obtain a bound of O(m-+ A) 
instead. To obtain this bound do a simple linear scan from 7) down to the 
desired position, instead of a binary search. Only reset the search if neces- 
sary. A sequence of decreasing numbers is refereed to as a batch. During 
a batch the position & of the scan is not reset. This means that processing 
a batch containing m numbers requires only O(m + A) time. Note that 
this is O(1) amortized time per number, when the batch contains at least A 
numbers. The pseudo code for the AppendBatch procedure is shown in Al- 
gorithm 5, where line 8 is computed in O(1) amortized time with the finger 
tree data structure and line 5 accumulates to O(A) in a decreasing sequence. 
Note that the local variable k preserves its value among successive calls. 

We can now summarize our dynamic LIS data structure in the following 
theorem: 


Theorem 2 Consider the Longest Increasing Sub-Sequence of a dynamic 
list of numbers, which starts and finishes empty. Assuming that in total £ 
elements are inserted into the structure, in d batches of decreasing sequences 
and also that the ExtractMin operation is executed e times in total, then 
the overall time for this is bounded by O(eX(1 + log(min{A, 0/A})) + 0+ d)), 
where X > 1 is the size of the largest overall LIS. At anytime the size of the 
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current LIS can be obtained in O(1) time. 


We can now combine the results of Theorem 1 and 2 to obtain our 
bounds for the decremental string comparison problem. 


Theorem 3 Given strings P and S there exists a data structure that can 
be used to obtain the size of LCSS between these strings, A > 2, which 
requires O(£) space, where £ is the number of matches between P and S. This 
structure can be updated to the strings P.c and S in O(A+|S]|) time, where c 
is any letter. It can also be updated for the strings P and S’, where S = c.S", 
in O(A(1 4+ log(¢/A))) time. A sequence of operations that starts with an 
empty string and inserts letters to form the string P requires O(|P\A + £) 
time. A sequence of operations that decrements S until it becomes empty 
requires O(min{|S], 2}A(1 + log(min{A, 2/A})) + |S|) time. 


The amortized complexities follow from Theorem 2, and the extra min 
that appears is a bound on the number of ExtractMin operations, e in 
Theorem 2. This number of operations is bounded simultaneously by || 
and by @, because we cannot remove more points than the ones that exist 
inside the structure. However in the case where e < |S| it is necessary to 
add an O(|S|) term. This corresponds to the case where the letter c that is 
being removed from S' has no occurrences in P. In this case there is no call 
to ExtractMin operation but this verification still needs to be performed, 
which requires O(1) time and must be accounted for. 

Our application of computing the LTSS now follows from Theorem 3. 
The total amount of time the LTSS algorithm is therefore O(min{n, ¢}A(1+ 
log(min{A, 2/A})) +n+ 2), where > 2 is the size of the LTSS and @ > 2 is 
the number of pairs of positions in F' that contain the same letter. 


5 Related Work 


An initial efficient algorithm to compute the LTSS, for the simple case of 
only one string F’, was given by Kosowski [15]. This algorithm required 
optimal O(n”) time and O(n) space. Tiskin [20, Section 5.6] presented 
an algorithm which obtains the smallest worst case bound by exploring 
the Monge properties of the respective distance matrices. This property 
depends on the fact that the graph underlying the D table of two strings is 
planar. The resulting algorithm obtains the overall worst case time bound 
of O(n? (log log n)?/(log n)?). 
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The work on incremental string comparison was initiated by Landau, 
Myers, and Schmidt [16], which obtained an O(n) time algorithm to ob- 
tain D’ from D. A simpler version, with the same performance was pre- 
sented by Kim and Park [14], which is simultaneously incremental and 
decremental. This is the first instance of the decremental variation of the 
problem. This solution was presented for the edit distance. Ishida, Inenaga, 
Shinohara, and Takeda [11] presented an algorithm which reduced the time 
complexity from O(n) to O(A) and was fully incremental. The algorithm 
was presented for the LCSS and they also reduced the space requirements 
from O(n?) to O(nd). 


Landau, Myers, and Ziv-Ukelson [17] studied the problem of consecu- 
tive suffix alignment problem, which obtained the size of the LCSS between 
all the suffixes of a string A and a string B, the final version of the paper 
appeared in [2007]. The authors presented two algorithms for this problem, 
which required O(nA) and O(nA + nloga) time, where o is the size of the 
alphabet of the underlying strings. Their approach uses a structure similar 
to the J; lists from the Hunt-Szymanski algorithm, but contrary to our ap- 
proach of Section 3 the elements are prepended to a variation of the Tj, lists. 
Moreover their structure is not decremental. Because of these nuances the 
relation to LTSS is not immediate which justifies the algorithm of [15], in 
the same year. 


A corner stone of all these results is the algorithm from Hunt and Szy- 
manski [9], whose crucial idea was the reduction from the LCSS to the LIS, 
although this was not immediately clear in the original presentation. It was 
partially identified by [1], [2] and made explicit by [12] and independently 
by [19]. Interestingly the original presentation of Hunt and Szymanski [9] 
reported an O((n+2) log @) time bound, where £ is the size of the sequence L. 
This is a significant improvement over the plain dynamic programming al- 
gorithm, which always requires O(n?) time. Although in the worst case ¢ 
may be O(n”), in general it may be significantly smaller. The original 
complexity was not always faster than the plain algorithm, because @ may 
be Q(n/logn). This issue was addressed by Apostolico [1] which obtained 
O(n?) time worst case guarantees. Their algorithm already considered using 
finger trees to represent the J; lists. Improvements of the Hunt-Szymanski 
algorithm based on bitwise operations where proposed by Crochemore, II- 
iopoulous, and Pinzon [5]. 


A data structure that supports dynamic longest increasing sub- 
sequences was presented by Chen, Chu, and Pinsker [4]. The focus is 
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in supporting insertions anywhere in the sequence, which is achieved in 
O(1 + Alog(é/A)) time. The authors obtain one corresponding LIS in 
O(A + log@) time. This is more efficient than the procedure we explain 
before Theorem 1, however our procedure can be used to obtain all the sub- 
sequences, whereas their approach obtains only one. Their data structure 
is similar to the one we present, which is expected as both are related to 
the Hunt-Szymanski algorithm. Chen et al. [4] use level key lists Ly, which 
are similar to our TJ; lists, but store index value pairs and are sorted by 
increasing index. This is similar to the structure we use for the GetLIS op- 
eration, but the lists are flattened, instead of storing the indexes in a second 
structure. Moreover they also use red-black trees to split and concatenate 
lists and also mention exploring fingering properties of the structure. The 
presentation mentions deletions but the focus is on insertions. It seems 
plausible their representation could also support deletions efficiently. 

The most recent approach for computing the LTSS was proposed by In- 
oue, Inenaga, and Bannai [10]. Their algorithm is very similar to the one we 
present in this paper. They also reduce the problem to a dynamic LIS prob- 
lem and used the data structure of Chen, Chu, and Pinsker [4] to obtain a 
complexity of O(min{n, €}\(1 + log(€/A)) +n + €logn). In the next section 
we explain how our algorithm improves upon their result and discuss future 
possible improvements. 


6 Conclusions 


In this section we recall and discuss the contributions of the paper in context. 
In this paper we presented a new algorithm to determine the longest tandem 
scattered sub-sequence of a string Ff’. In the process we introduced the 
decremental string comparison problem and provided new data structures 
to support dynamic LIS sequences. We studied a dynamic version of the 
Hunt-Szymanski algorithm, which yielded several interesting results. 

Considering the LTSS problem itself the strongest work case bounds 
where obtained by Tiskin [20] with an O(n?(log log n)?/(log n)”) time bound. 
Both this algorithm and the one by Kosowski [15] seem to have the average 
case with the same bound as the worst case. 

The algorithm we obtain as a consequence of Theorem 2 obtains 
O(min{n, 2}A(1 + log(min{A, 2/A})) + n+ £) time, where A > 2 is the 
size of the LTSS and ¢ > 2 is the number of pairs of positions in F’ that 
contain the same letter. Hence when @ = 0(n?(loglogn)?/(logn)*) and 
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A = o(n(log log n)?/((log(min{A, £/A}))(log n)?)) our algorithm becomes 
more efficient. Thus our algorithm is most efficient when the size of the 
LTSS is small. The extreme case in favor of our algorithm occurs when all 
the letters in F' are distinct. In this case our algorithm is actually linear, i.e., 
O(n) time and space. This particular case is trivial but a similar situation 
occurs when the alphabet size is large, i.e., polylog. This is also the case 
were the original Hunt-Szymanski algorithm obtains its best performance. 

One important contribution of this work is the relation between the 
LTSS and the decremental and incremental string comparison algorithms. 
This relation is straightforward but seems to have remained unnoticed‘, as 
the algorithm of Ishida et al. [11] also depends on \ and could thus be used 
to compute the LTSS, if the structure was also decremental. On the other 
hand the structure of Kim and Park [14] is decremental but does not depend 
on A. This relation was indeed explored in the work of Tiskin [20], but as 
mentioned above the resulting algorithm is also not dependent on A. Hence 
for the case of a single string the algorithm we presented in Section 3 yields 
competitive results. In fact it is very interesting to compare the Ty, lists of 
the Hunt-Szymanski algorithm to the incremental data structure of Ishida 
et al. [11]. In essence their structure consists in expanded T;, lists, where 
each element is repeated several times so that the list becomes size n. This 
increases the space requirements but makes navigating the lists and across 
lists more convenient. Also it forces one of the operations be O(n) and thus 
the overall bound is always O(n”) instead of O(nA). 

The recent work by Inoue, Inenaga, and Bannai [10] follows essentially 
the same approach as this paper. It solves the LTSS problem by resorting to 
the same decremental string comparison approach and solves this problem 
using the Hunt-Szymanski reduction to a LIS problem. A dynamic version of 
the LIS problem that supports the ExtractMin operation is also considered. 
In fact we were unaware of the similarity of their approach until recently. 
Still our approach contains several key insights which allow us to obtain a re- 
sult that is competitive against their O(min{n, @}A(1+log(¢/A))+n-+¢ log n) 
time bound®. They use essentially the dynamic LIS structure of Chen, Chu, 
and Pinsker [4] and propose only one improvement, batched ExtractMin 
operations. This means that they obtain good performance for a sequence 
of ExtractMin operations. 


‘It was also recently pointed out by Inoue, Inenaga, and Bannai [10]. 

°Note that we added an O(n) term to their complexity result, because in the case that 
we considered when all the letters of F' are distinct we have = = 0, but their algorithm 
still requires O(n) time, as does ours. 
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We obtain the same improvement by using our duplicate discarding ap- 
proach. Therefore a single ExtractMin operation on our data structure cor- 
responds to several on theirs, because their ExtractMin operation removes 
one duplicate at a time whereas ours removes all. Hence our ExtractMin 
operation requires the same time as their batch of ExtractMin. One very 
important optimization of our approach is using the finger trees to represent 
the Tj, lists which leads to the improved performance of the AppendBatch 
operation, Algorithm 5. This implies that there is no O(log n) factor associ- 
ated to the @ term in our complexity. This term is most likely to dominate 
the overall complexity in several interesting cases and in our algorithm it 
is O(¢) whereas in theirs it is O(¢logn). However this is a tradeoff as we 
obtain an extra O(nA) term, whereas theirs is only O(n). Ignoring the first 
term, the resulting comparison is between O(n + ¢log 7) for their algorithm 
and O(n + @) for ours. Hence our algorithm obtains better performance 
provided that @ = Q(n(A — 1)/((logn) — 1)) and n < @ because of the first 
term. Let us consider a very simple example where this is likely to happen. 


Assume that the letters of F' are obtained independently and uniformly 
at random from an alphabet of size a. In this case ¢ is expected to be (n/c)? 
and the distribution is highly concentrated around this value. Hence in order 
for our algorithm to obtain the best theoretical bound it is necessary to have 
an alphabet size o that is smaller than \/(n((logn) — 1))/(A — 1). This is 
actually a very loose bound, much larger than poly logarithmic alphabets. 
Hence our algorithm’s theoretical bound yields the best performance for 
most alphabets, except for exceedingly large ones. Even in extremely large 
alphabets there is the mitigating expectation that o and X have an inverse 
relation, meaning that larger values of o should yield smaller values of X. 


The final improvement on the work of Inoue, Inenaga, and Bannai [10] 
is the analysis with Lagrange multipliers that yields the log(min{A, £/A}) 
bound that improves on the previous log(¢/X) complexity. Also we believe 
that future research on these data structures will focus precisely on this 
factor. One approach that seems promising is to store the 7; lists in a 
data structure that supports the dynamic fractional cascading technique 
of Chazelle and Guibas [3], potentially reducing this factor to O(log log n). 


The final contribution of this paper is the data structure to maintain 
a dynamic LIS. Our approach uses a couple of nuances that allow us to 
obtain Theorem 2. Using the AppendBatch of Algorithm 5 the complexity 
of the original Hunt-Szymanski algorithm drops to O(€+n4A), which is never 
more than O(n”) and sometimes much better. Given the importance of this 
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algorithm, similar improvements have already been proposed by [1]. Still 
our work provides a fairly simple alternative. 
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A LIS Retrieval 


To simplify the exposition and the analysis we have thus far omitted how to 
retrieve the actual LIS. This process can be supported by augmenting our 
data structure, which we will now explain how. 

Recall the Z sequences that occur in our running example. In the top of 
Figure 5 we show these sequences, numbered Lj, Lz and L3, corresponding 
to the pairs of strings (P, S), (P’,.S’) and (P”, S”). To retrieve the elements 
from the list L we need to index them. For JL, this is straightforward 
to obtain, we simply number the elements from 1 to 10. However when Ly 
changes to Lz and the number 1 is removed, we do not re-index the sequence. 
A gap is left at position 3. Likewise when D2 changes to L3 a gap is left at 
position 2. Position 12 can be re-used because it was the last position of Lg. 

To retrieve the sequences we augment the elements inside each 7}. Each 
element stores a value of 7 of Z and a list which contains positions where 7 
occurs. These lists must contain at least one such position, but may contain 
more than one. The lists contain other positions precisely to avoid repeated 
elements in a Ty. To support LIS retrieval duplicated elements are not 
dropped, instead their positions are stored in these lists. Figure 5 shows 
this structure for Lz where the list 7; contains the element 2 and the position 
list 2,12. Note that these position lists can be stored in increasing order, 


Small Longest Tandem Scattered Subsequences 103 


12345 67 89 10 11 12 


Figure 5: Top: Example dynamic list L. Middle: Augmented threshold 
data structure for L2. Bottom: Longest increasing sequences for D2 in the 
order produced by Algorithm 6. 


for each element. Moreover the concatenation of these lists, for a fixed Ty, 
is also sorted in increasing order. We refer to these global lists as P,. Hence 
for Lz we have P; = 1,2,12 and P, = 4,5,6,7 and P3 = 8,9,10 and P, = 11. 

The possible sequences for our problem are shown in the bottom part 
of Figure 5. Each sequence is obtained by choosing one element from 7), 74, 
T3 and Ty, in general one element from 7, to T,. The sequences must be 
increasing in the values chosen from 7; and also in the values chosen from 
P,. To guarantee that searching through these lists always yields a sequence 
of size A the procedure starts with k = and proceeds to decrease k. This 
is illustrated by the big arrow in Figure 5. 


In our example we start at T, and choose the value 8 with corresponding 
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position 11. Now we aim to determine which elements of 73 may occur in 
LIS sequences that terminate at 8. To determine the first such element we 
can compute Predecessor (73, 8—1), i.e., find the largest value in T3 that 
is strictly smaller than 8. Likewise the last such element in P3 should be 
Predecessor(P3, 11 —1), i.e., it must occur in a position strictly smaller 
than 11. Recall that 73 is stored in decreasing order and P3 in increasing 
order. Therefore these predecessors define the interval of valid element 
choices for a LIS. In general the interval of interest for a given element 
(i,p) of Ty11 is between Predecessor(T;,, i — 1) and Predecessor(Py, 
p—1). The Predecessor on Tj, is obtained from the BST for T;, in O(log £) 
time. The Predecessor on P; is conceptual and it is enough to verify that 
p <p, where p’ is the current position in P,. We illustrate these operations 
with arrows in Figure 5. The dashed lines are used for Py, and the filled 
lines for 7. The figure also illustrates the interval for element 5 of 73, 
i.e., Predecessor (J>,5—1) and Predecessor(P),9—1). Moreover it also 
shows the interval for element 4 of Tb, i.e., Predecessor(7,,4 —1) and 
Predecessor(P,,6—1). Therefore, iterating the Predecessor (T;,, 1-1) 
operations, we can obtain the lexicographically largest LIS in O(A log @). To 
obtain the remaining LIS we traverse these intervals, yielding a new LIS for 
each position that is visited. Algorithm 6 details this procedure. 


The arguments for the RECURSIVEGETLIS are respectively, a stack S, 
which starts empty, a value for k, a value 7 of LD in 7,41 and a corresponding 
position p in Pyi,. The Predecessor operation is extended to return the 
positions p, besides the 7 values. Since the operation is on 7; it returns 
the smallest p for the corresponding 7, in our example Predecessor (7}, 
4—1) returns (2,2) instead of (2,12). If the corresponding T;, is empty 
then it returns (—co,+00). Moreover for this algorithm we also use the 
Next operation, which behaves as an iterator and returns an (i’,p’) pair. It 
returns the next element, for example Next (P;), returns (2,12), assuming 
it is the first invocation after Predecessor(7;, 4— 1). If there is no such 
element it returns (+00,+00). Assume that JT; is represented as a BST 
and P, is divided into lists, each inside a node of the BST as shown in the 
middle of Figure 5. The Next operation either moves to the next element 
in the current list, or to the next node on the BST, when it reaches the 
end of the current list. Note that by next on the BST we mean a smaller 
value of i, as the T; are stored in decreasing order. Moving to the next 
element on a list requires constant time, but finding the next element on 
the BST may require O(log @) time. Hence Algorithm 6 obtains each LIS in 
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Algorithm 6 GetLIS() 
Require: L is not empty 
1: RECURSIVEGETLIS(0,A + 1, +00, +00) 


2: procedure RECURSIVEGETLIS(S, k, 7, p) 
3 if k = 0 then 

4 return S > Found a LIS 
5 else 

6: (i’,p’) < Predecessor (Ty, i — 1) 

7 while p’ < p do 

8 Push(S, p’) 

9: RECURSIVEGETLIS(S,k — 1,7’,p’) 
10: Pop(S) 

Li: (i’, p’) — Next (Px) 

12: end while 

13: end if 


14: end procedure 


O(A log £) time, which again can be reduced to O(1+ A log(@/X)) by Jensen’s 
inequality. 


B- Correctness 


We focused mainly on the complexity of operations Append and ExtractMin. 
However for completeness we will now establish that the procedures of Algo- 
rithm 3 and Algorithm 4 are correct. Proving the correctness of the Append 
operation per si is not essential as this was the procedure proposed by Hunt 
and Szymanski [9] for determining a “static” LIS. Hence any sequence of 
Append operations will compute a LIS. All we need to do is to verify the 
interaction with the ExtractMin operation. This leads us to identify the 
invariant properties of the Tj lists. The first more evident properties are 
related to the sorted order of the T}, lists. 


Lemma 1 Each T;, list stores its elements sorted in decreasing order. 
These Lemmas are proven using an argument of structural induction, 


meaning that we assume that the property holds before a given application 
of a Append or ExtractMin operation and only need to prove that after 
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the operation terminates the property is valid in the resulting data struc- 
ture. We use superscript b’s to represent the structure before the operation, 
meaning that a 7 list represents a list before the operation is applied. The 
list after the operation is applied is simply a 7}, list. Also, for an element 2 
in 7; we say that it is at level k. 

Proof: Let us separate the analysis into the two operations: 


Append: before the element 7 is added to list Tj in line 12 a binary search 
is used to determine k. This process guarantees that 7 is smaller 
than or equal to Min[k], i.e., smaller than or equal to the last element 
of T;,. Our discarding approach means that if is equal it gets discarded. 
Therefore the list are kept in strictly decreasing order. The order of 
the other lists remains unaltered. 


ExtractMin: a 7;_ 1 list that results from this operation consists in the 
concatenation of a prefix of the previous Te 2 list with a suffix of a ae 
list. This process is executed in lines 7 and 8 of Algorithm 4. The 
main observation here is that the prefix of sae ends in a value nz_1, 
computed by Min(7;,_1) and the suffix starts at a value vz, computed 
in line 6 and the Predecessor operation in this line guarantees that 
Ne—-1 > vg. Note that in our algorithm if nz_, = vz the value vz gets 
discarded. This means the overall list preserves its strictly decreasing 
order. 


Lemma 2 The sequence of the minimum elements of the T;, lists is sorted 
in increasing order. 


Proof: Again let us separate the analysis into the two operations: 


Append: before the element 7 is added to list Tj, in line 12 a binary search is 
used to determine /. This process guarantees that 2 is strictly larger 
than Min|k — 1], i.e., larger than the last element of T;,_1;. Moreover 
it also guarantees that 7 is smaller than or equal to Min[k], which by 
hypothesis is strictly smaller than Min[k+ 1]. Therefore the increasing 
order of the last elements is maintained. The order of remaining last 
elements is preserved. 


ExtractMin: this operation does not alter the relative order of the last 
elements. It divides the Te lists and moves a suffix of 7? to Th_4, 
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this means that the minimum elements simply move to a lower level. 
Hence we are only interested in the level & where this process finishes, 
Le., ae = T;. Which means that the minimum element of this level 
gets preserved. Moreover for all k’ > k their minimum elements also 
get preserved and therefore their relative order remains an increas- 
ing sequence. Hence we only need to verify two minimum relations 
between J; and 7;,_; and between 7;,_, and 7}. Note that T)_4 
is only a prefix of fae because it does not pull elements from ie 
this means that the minimum of 7;,_; and the minimum of T;_» are 
actually two elements that occurred in the same ‘cae Because of 
Lemma 1 this implies that the minimum of T7;_» is strictly smaller 
than the minimum of 7;_;, because the latter occurs first in TP as 
Hence we are left with verifying the relation between the minimum of 
Pea and the minimum of T; 4 . This is precisely the relation that is used 
in the while guard in line 5. Because we are considering the position 
where the process terminates we have that the while guard is false 
which implies that Min(T;,_;)<Min(Tj},). Hence the overall sequence of 
minimum elements of the 7; lists remains sorted in increasing order. 


From Lemma 1 we can conclude that when an element 7 is inserted into the 
list ZT; it most be the smallest element on that list. Moreover from both 
these Lemmas we can conclude that for any index & the minimum element 
of Ty, is strictly smaller than any element in Tj for any level k’ > k. In 
particular for k = 1 this implies that the last element of T is actually the 
overall minimum. 


This is the last of the order properties that is required for our study. 
Now we need to analyze other properties of the structure. For this particular 
analysis we strip away the layers of complexity that are related to the fact 
that the TJ; lists are sorted. Instead we use sets of positions. Moreover the Tj, 
lists are actually a compressed version of the position sets P;,, by compressed 
we mean that our duplicate discarding approach that there might not be a 
one to one correspondence between J), and P,. We will now use the position 
sets a they provide for simpler notation. 


First we define a set P of positions as a finite set of integers. The 
integers do not have to be consecutive and gaps may occur. For example for 
the list Zo in Figure 5 the set P consists of the integers from 1 to 12 except 
for 3. This set is then partitioned into the P, sets. Hence the resulting sets 
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are the following: 


P, ={1,2,12} 
Py ={4,5,6,7} 
Ps ={8,9,10} 
Py ={11} 


In general the Py, sets form a disjoint partition of P meaning that P,N Py = 
for any indices k 4 k’ and P= P,U...UP). A Ty set can be obtained 
from the corresponding P, set by accessing the underlying sequence J, i.e., 
Ty, = {L|p||p € P.}. We use the notation Lp] to refer to the value in L 
that corresponds to position p, for example using sequence Lz we have that 
L{12] = 2. The resulting P; sets for this example are the following: 


Ty ={8, 2} 

Tz ={6,5, 4,3} 
T3 ={6,5, 4} 
Ty ={8} 


We identify two fundamental invariants for the P;, sets. The first concerns 
the order in the Py sets and the second is used to guarantee a LIS of size k. 
These two properties are summarized in the Lemma 3 and Lemma 4. 


Lemma 3 Let p € Py and p! € Py with k < k’ be positions from a list L, 
if L{[p| > L{p'] then p < p’. 
Proof: 


Append: check Algorithm 3. Assume the element being inserted is 7. We 
only need to check the cases when i = L[p] or 1 = L[p’], because if p 
and p’ are any other positions they do not move from P, and Py 
and their previous relation is preserved. Assume first that i = L[p’J, 
because the operation is an Append we have that p’ is the maximum 
value inside P, therefore p < p’ and the implication holds because the 
consequent is true. Now let us consider the case where i = L[p], this 
means that p got inserted into P, and therefore L[p] is the minimum 
of T; (Lemma 1). Therefore we have that L[p| < L[p'], as was observed 
just after the proof of Lemma 2. Therefore the antecedent is false and, 
again, the implication holds true. 
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ExtractMin: check Algorithm 4. There are three relevant cases. Either p 
was in level k + 1 before the operation or p’ was in level k’ + 1 before 
or both. Note that the case where both p and p’ remain in the same k, 
ie, p € PP and p' € Pi? is trivial. 


First consider the cases where p’ moves from k’+1 to k’, ifk <k’+1 
the property is preserved and therefore the implication holds. The 
only tricky case is when k = k’ +1. If, on the other hand, p stays 
fixed at k and p’ moves to k’ < k then the implication might change 
but the Lemma still holds because the condition k < k’ is not satisfied. 


Second let us consider the case where p moves from k + 1 to &, 
particularly when k = k’ and p’ stays fixed at k’. This time we 
can not copy the validity of the implication because it is not true 
that k + 1 < k’ and therefore the validity of the implication might 
be false because the Lemma condition was not satisfied. In this 
case we have that uz, > Lp], where vz is the value assigned to 
the variable v in the line 6 of Algorithm 4 when processing list Ty. 
Moreover, because the Predecessor operation is used, we have that 
Nk > Up+1 Where ng_1 is the value Min(7;,_1) computed in line 6 of 
Algorithm 4. Moreover because nz is a minimum and p’ remains at 
level k we have that L[{p'] > nz. Combining these inequalities yields 
L[p'] > nk > vei > Lp], which by transitivity implies L[p'] > L[p}, 
thus making the antecedent of the implication false and verifying its 
validity. 


Lemma 4 For any position p € Py with 1 < k there exits a position p! € 
Py—1 such that p’ < p and L[p| > Lp’). 


Proof: 


Append: Assume p gets inserted into some P, with k > 1. Moreover let 
p! € Px_1 be the position for which L[p’] is the minimum in T;,_1. 
Then the observation after the proof of Lemma 2 guarantees that 
L|p| > L{p’'}. Also note that the Append operation always inserts the 
maximal p value and therefore we have trivially that p’ < p, thus 
satisfying the conditions in the Lemma. 


ExtractMin: We have two possibilities either p moved from level k + 1 to 
level k or it stayed at level k meaning that it started in set pp and 
ended in the set P,. 
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First consider that p moves from level k + 1 level k. By hypothesis 
there a p' € P? that satisfies p’ < p and L[p| > L[p’]. If this position 
moves to level k — 1, i.e., p’ € Py_1 then the Lemma is verified. The 
alternative does not occur, as can be seen by contradiction. Suppose 
that we have both p and p’ in Py, and L[{p] > L{p'], according to 
Lemma 3 this implies that p < p’ which contradicts the hypothesis 
that p’ < p. 


Second consider that p stayed at level k, meaning that L[p| > ng_-1, 
where n,_1 is, as in the previous proofs, the value of Min(7;,_1) ob- 
tained in line 6 of Algorithm 4. This holds precisely because nz_1 is 
used to determine v,; which is used to split TP Let p’ € Py_; bea 
position that justifies np_1, i.e., L[p'] = nz—1. Hence we can write 
that L[p| > L{p']. If p! < p then we obtained our desired element. 
Otherwise we still have by hypothesis that there is a p” € Paaae such 
that p” < p and L[p] > Lip"). The only problem would be if p” also 
moved down to level k — 2. This does not occur, as can be verified by 
contradiction. If p” were to move we would have L[p'] = ng_-1 > Lp") 
and would conclude that p’ < p” by Lemma 3. We thus apply transi- 
tivity to our contradiction hypothesis that p < p’ and conclude that 
p <p”. This in turn contradicts the hypothesis that p” < p. 


We can now prove that the data structure that we used for Theorem 1 is 
correct. In essence we only need to show that if list TJ; is not empty there 
the sequence L contains a LIS of size at least k. This is simply a matter 
of iterating Lemma 4 starting with some position p where L[p| € T;. This 
yields a list of k positions pp = p,ppr—1 = p’,.--,pi, such that py <... < 
Pe-1 < Pr and L[pi] < ... < L[pp_1] < Llp], where both conditions are 
consequences of the Lemma. 
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